NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Federated Black-box Prompt Tuning System for Large Language Models on the Edge

https://doi.org/10.1145/3636534.3698856

Li, Yiming; Sun, Jingwei; Liu, Yudong; Zhang, Yuandong; Li, Ang; Chen, Beidi; Roth, Holger R; Xu, Daguang; Chen, Tingjun; Chen, Yiran (December 2024, ACM)

Full Text Available
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Zhao, Jiawei; Zhang, Zhenyu; Chen, Beidi; Wang, Zhangyang; Anandkumar, Anima; Tian, Yuandong (July 2024, International Conference on Machine Learning (ICML))

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
more » « less
Full Text Available
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Zhao, Jiawei; Zhang, Zhenyu; Chen, Beidi; Wang, Zhangyang; Anandkumar, Anima; Tian, Yuandong (July 2024, International Conference on Machine Learning (ICML))

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
more » « less
Full Text Available
EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS

Xiao, Guangxuan; Tian, Yuandong; Chen, Beidi; Han, Song; Lewis, Mike (May 2024, The Twelfth International Conference on Learning Representations)

Full Text Available
Soft prompt recovers compressed LLMs, transferably

Xu, Zhaozhuo; Liu, Ziru; Chen, Beidi; Zhong, Shaochen; Tang, Yuxin; Wang, Jue; Zhou, Kaixiong; Hu, Xia; Shrivastava, Anshumali (July 2024, JMLR.org)

Full Text Available
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Zirui; Yuan, Jiayi; Jin, Hongye; Zhong, Shaochen; Xu, Zhaozhuo; Braverman, Vladimir; Chen, Beidi; Hu, Xia (June 2024, Proceedings of Machine Learning Research)

Full Text Available
Fast Algorithms for a New Relaxation of Optimal Transport

Charikar, Moses; Chen, Beidi; Re, Christopher; Waingarten, Erik (July 2023, Proceedings of Machine Learning Research)

Full Text Available
Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions

Massaroli, Stefano; Poli, Michael; Fu, Daniel Y.; Kumbong, Hermann; Parnichkun, Rom N.; Timalsina, Aman; Romero, David W.; McIntyre, Quinn; Chen, Beidi; Rudra, Atri; et al (December 2023, Proceedings of the 36th Neural Information Processing Systems Conference (NeurIPS))

Full Text Available
Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Chen, Beidi; Dao, Tri; Liang, Kaizhao; Yang, Jiaming; Song, Zhao; Rudra, Atri; Re, Christopher (January 2022, International Conference on Learning Representations (ICLR))

Full Text Available
Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Dao, Tri; Chen, Beidi; Sohoni, Nimit S.; Desai, Arjun; Poli, Michael; Grogan, Jessica; Liu, Alexander; Rao, Aniruddh; Rudra, Atri; Re, Christopher (January 2022, Proceedings of the 39th International Conference on Machine Learning)

Full Text Available

« Prev Next »

Search for: All records